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Abstract 

In this study, we calculated the codon usage bias in severe acute respiratory syndrome Coronavirus (S ARSCoV) and performed a comparative 
analysis of synonymous codon usage patterns in SARSCoV and 10 other evolutionary related viruses in the Nidovirales. Although there is a 
significant variation in codon usage bias among different SARSCoV genes, codon usage bias in SARSCoV is a little slight, which is mainly 
determined by the base compositions on the third codon position. By comparing synonymous codon usage patterns in different viruses, we 
observed that synonymous codon usage pattern in these virus genes was virus specific and phylogenetically conserved, but it was not host 
specific. Phylogenetic analysis based on codon usage pattern suggested that SARSCoV was diverged far from all three known groups of 
Coronavirus. Compositional constraints could explain most of the variation of synonymous codon usage among these virus genes, while gene 
function is also correlated to synonymous codon usages to a certain extent. However, translational selection and gene length have no effect 
on the variations of synonymous codon usage in these virus genes. 

© 2004 Elsevier B.V. All rights reserved. 
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1. Introduction 

Synonymous codons are not used equally both within 
and between genomes (Grantham et al., 1980; Martin et al., 
1989; Lloyd and Sharp, 1992). Compositional constraints 
and natural selection are thought to be the two main factors 
accounting for codon usage variation among genes in differ¬ 
ent organisms (Karlin and Mrazek, 1996; Sharp et al., 1986; 
Lesnik et al., 2000). The diverse patterns of codon usage in 
mammals may arise from compositional constraints of the 
genomes (Karlin and Mrazek, 1996; Francino and Ochman, 
1999; Majumdar et al., 1999; Ghosh et al., 2000). In con¬ 
trast, in some unicellular organisms, such as Escherichia coli 
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and Saccharomyces cerevisiae , high expressed genes have a 
strong selective preference for codons with a high concentra¬ 
tion of the corresponding acceptor tRNA molecule, whereas 
low expressed genes displayed a more uniform pattern of 
codon usage (Gouy and Gautier, 1982; Grantham et al., 
1981; Ikemura, 1981, 1985; Sharp et al., 1986; Lesnik et al., 
2000). Moreover, mutational pressure rather than trans¬ 
lational selection is the most important determinant of 
the codon bias in some human RNA viruses (Levin and 
Whittome, 2000; Jenkins et al., 2001; Jenkins and Holmes, 
2003). Furthermore, replicational and transcriptional selec¬ 
tion is responsible for the codon usage variation among the 
genes of Borrelia burgdorferi (Mclnerney, 1998). In some 
other researches, codon usage was also found to be related 
to gene function (Chiapello et al., 1998; Epstein et al., 2000; 
Ma et al., 2002), protein secondary structure (Chiusano et al., 
1999, 2000; Oresic and Shalloway, 1998; Xie and Ding, 
1998; Gupta et al., 2000), cellular location of gene products 
(Chiapello et al., 1999) and gene length (Coghlan and Wolfe, 
2000; Marais and Duret, 2001; Moriyama and Powell, 1998). 

Severe acute respiratory syndrome (SARS) is a respi¬ 
ratory disease that was recently reported in Asia, North 
America and Europe (Chan-Yeung and Yu, 2003; Drazen, 
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Table 1 

Identified ORFs (length > 150 bps) in the SARSCoV (TOR2 isolation) 
genome a,b 


Gene product 

L a 

ENC 

cc 3S (%) 

r/b 

J 1 

Putative orflab polyprotein 

21222 

48.47 

32.20 

-0.60 

Orfla poly protein 

13149 

48.24 

33.10 

-0.57 

Putative spike glycoprotein 

3468 

45.73 

28.30 

-0.85 

Putative uncharacterized protein 

825 

47.66 

34.50 

-0.37 

Putative uncharacterized protein 

465 

42.80 

45.10 

1.34 

Putative small envelope protein E 

231 

59.06 

38.70 

0.34 

Putative protein M 

666 

59.04 

42.50 

0.51 

Putative uncharacterized protein 

192 

42.19 

28.80 

-1.08 

Putative uncharacterized protein 

269 

43.05 

30.60 

-0.55 

Putative nucleocapsid protein 

1269 

54.16 

37.60 

0.49 

Putative uncharacterized protein 

297 

46.62 

58.10 

1.87 


a L represents the length of identified ORF. 
b f[ represent the first axis values of each gene in CA. 


2003). Although genome sequence of severe acute respira¬ 
tory syndrome Coronavirus (SARSCoV) has been published 
and many studies have been performed on SARSCoV in 
recent months (Paul et al., 2003; Qin et al., 2003; Marra 
et al., 2003; Snijder et al., 2003), little genomic analysis 
is available on this virus. Codon usage data of SARSCoV 
might give some clues to the features of SARSCoV genome 
and some evolutionary information of this virus. Here, 
we analyzed the codon usage data of this virus and other 
viruses in the order Nidovirales. The key evolutionary de¬ 
terminants of codon usage bias in these viruses were also 
investigated. 

2. Materials and methods 

2.7. Materials 

SARSCoV is a large, enveloped, positive-stranded RNA 
virus, which belongs to order Nidovirales , family Coro- 
naviridae, genus Coronavirus in virus taxonomy (Marra 
et al., 2003). The complete genome and coding sequences 
of SARSCoV TOR2 isolation were obtained from GenBank 
(Version 134.0). To keep the statistical significance of codon 
usage bias, only sequences with length above 150 bps were 
analyzed (Table 1). To compare the codon usage pattern 
among different viruses, coding genes of 10 other viruses 
belonging to order Nidovirales (six viruses in the genus 
Coronavirus , four viruses in the genus Arterivirus) were 
also parsed from GenBank (Version 134.0) (Table 2). 

2.2. Methods 

2.2.1. Synonymous codon usage measures (RSCU) 

Relative synonymous codon usage values of each codon 
in a gene were used to examine the synonymous codon us¬ 
age without the confounding influence of amino acid com¬ 
position (Sharp and Li, 1986). N 3 S, the frequency of base 
N at synonymous third codon positions, was also used to 


Table 2 

Phylogenetic breakdown, accession number, GC 3 S and the first two axis 
values in CA of 11 selected viruses in order Nidovirales a,h 


Organism a 

Accession number 

gc 3S (%) 

s/b 

J 1 

r/b 

J 2 

Coronavirus 

HCoV 229E 

NC _002645 

30.89 

-0.84 

-0.16 

PEDV 

NC-003436 

37.32 

-0.04 

0.42 

TGV 

NC-002306 

27.02 

-0.99 

-0.08 

BCoV 

NC-003045 

29.43 

-0.75 

0.48 

MHV 

NC _001846 

38.30 

-0.16 

0.27 

AIBV 

NC_001451 

26.09 

-0.90 

-1.30 

SARSCoV 

NC.004718 

37.23 

0.05 

0.36 

Arterivirus 

EAV 

NC-002532 

47.28 

0.80 

0.47 

LDEV 

NC-002534 

45.18 

0.53 

0.43 

PRRSV 

NC _001961 

53.76 

1.31 

0.55 

SHFV 

NC-003092 

48.43 

1.09 

-0.14 


a Organism abbreviation : HCoV 229E, human Coronavirus 229E; 
PEDV, porcine epidemic diarrhea virus; TGV, transmissible gastroen¬ 
teritis virus; BCoV, bovine Coronavirus ; MHV, murine hepatitis virus; 
AIBV, avian infectious bronchitis virus; SARSCoV, SARS Coronavirus ; 
EAV, equine arteritis virus; LDEV, lactate dehydrogenase elevating virus; 
PRRSV, porcine reproductive and respiratory syndrome virus; SHFV, 
simian hemorrhagic fever virus. 

b f[ and 7 * 2 ? respectively, represent the first axis mean value and the 
second axis mean value in CA of each genome. 

calculate the extent of base composition bias. Additionally, 
the effective number of codons of a gene (ENC) was used 
to quantify the codon usage bias of a gene (Wright, 1990), 
which is the best overall estimator of absolute synonymous 
codon usage bias (Comeron and Aguade, 1998). ENC value 
ranges from 20 (when only one codon is used per amino 
acid) to 61 (when all synonymous codons are equally used 
for each amino acid). 

2.2.2. Correspondence analysis (CA) 

Correspondence analysis was used to investigate the ma¬ 
jor trend in codon usage variation among genes. Each gene 
is represented as a 59 dimensional vector, and each dimen¬ 
sion corresponds to the RSCU value of one sense codon 
(excluding AUG, UGG and three stop codons). 

CA based on RSCU values relies on two main steps 
(Mardia et al., 1979). The first step is to measure the sim¬ 
ilarities in codon usage using the squared Euclidean dis¬ 
tance among all genes, and the resulting distance table will 
be used to compute the coordinates of the genes in a mul¬ 
tidimensional space. The second step provides the visu¬ 
alization of these Euclidean distances through positioning 
genes by successive orthogonal projections of the cloud of 
points. Essentially, this process consists in finding the lin¬ 
ear transformations /(, // ... , / 5 r 8 of the original variables 
f[, f 2 ,... , f^ 9 . The/-variables are calculated and ordered 
according to the values of relative variance. f[ is the maxi¬ 
mum value; /J is the next value and is by construction not 
correlated with f[. The same applies to // // and so on, 
until 8 . So, genes with similar codon usage are neighbors 
on the components of projection. 
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2.2.3. Statistical methods 

Linear regression analysis was used to find the correla¬ 
tion between codon usage bias and nucleotide composition. 
One tailed t-test was used to compare the variation of codon 
usage between different gene groups (Ewens and Grant, 
2001). As a null hypothesis, it is assumed that mean values 
of codon usage indices in different gene groups is statisti¬ 
cally the same. Under the null assumption, t-statistic could 
be calculated. Then, P-value is derived and it is taken as 
significance when P- value is below 0.05. 

A C++ program was developed to calculate the codon 
usage indices for each gene. CA and other statistical analysis 
were performed with statistical software SPSS 11.0. 

3. Results 

3.1. Synonymous codon usage in SARSCoV 

The details of coding genes in SARSCoV and the over¬ 
all RSCU values of 61 sense codons in SARSCoV were, 
respectively, shown in Tables 1 and 3. All preferentially 
used codons in SARSCoV are all A-ended or U-ended 
codons (Table 3). SARSCoV is a GC poor genome with 
GC content of 37.52%. Due to compositional constraints, it 
is expected that A-ended and/or U-ended codons should be 
preferentially used in this genome. To study the codon us¬ 
age variation among different SARSCoV genes, ENC and 
GC 3 S values of different SARSCoV genes were calculated 
(Table 1). ENC values of different SARSCoV genes vary 
from 42.19 to 59.06, with a mean value of 48.99 and S.D. 
of 6.41. Because all ENC values of SARSCoV genes are 
much higher (ENC > 40), codon usage bias in SARSCoV 
genome is a little slight. However, there is a marked variation 
in codon usage pattern among different SARSCoV genes 
(S.D. = 6.41). Similarly, GC 3 S values of each SARSCoV 
gene also confirm the heterogeneity of synonymous codon 
usage among different SARSCoV genes, which range from 
28.3 to 58.1% with a mean of 37.23 and S.D. of 8.78%. 

3.2. Synonymous codon usage in different viruses is 
virus specific, hut not host specific 

CA was implemented for all identified ORFs from each of 
the 11 virus genomes as a single dataset, which consists of 
103 coding sequences. CA detected one major trend in the 
first axis which accounted for 15.40% of the total variation, 
and none of the other axes individually accounted for more 
than 7.60% of the total variation. A plot of the first axis and 
the second axis of each gene was shown in Fig. 1. Although 
this graph is a little complex with some overlap among genes 
from different genomes, it is clear that genes from a partic¬ 
ular genome tend to cluster together. The separation of one 
virus genome from other virus genomes is determined to be 
significant on both axes (Utest, P- value <10 -15 on the first 
axis and P- value <1CV 3 on the second axis). So, similar to 


Table 3 

Synonymous codon usage in SARSCoV a,b,c 


AA a 

Codon 

RSCU 

jsh 

AA a 

Codon 

RSCU 

jsh 

Ala 

GCU 

2.08 

531 

lie 

AUU 

1.72 

410 


GCC 

0.58 

147 


AUC 

0.67 

159 


GCA 

1.13 

288 


AUA 

0.62 

148 


GCG 

0.22 

55 

Cys 

UGU 

1.27 

280 

Gly 

GGG 

0.17 

37 


UGC 

0.73 

160 


GGA 

0.85 

182 

Thr 

ACU 

1.66 

427 


GGC 

0.95 

202 


ACC 

0.59 

153 


GGU 

2.02 

431 


ACG 

0.18 

46 

Val 

GUU 

1.71 

479 


ACA 

1.57 

406 


GUC 

0.67 

188 

Asn 

AAU 

1.24 

449 


GUA 

0.83 

232 


AAC 

0.76 

277 


GUG 

0.78 

219 

Gin 

CAA 

1.16 

298 

Leu 

UUA 

1.04 

238 


CAG 

0.84 

214 


UUG 

1.10 

251 

Tyr 

UAU 

1.12 

345 


cuu 

1.79 

409 


UAC 

0.88 

270 


cue 

0.83 

191 

His 

CAU 

1.29 

187 


CUA 

0.64 

147 


CAC 

0.71 

103 


CUG 

0.60 

138 

Asp 

GAU 

1.24 

463 

Phe 

UUC 

0.77 

260 


GAC 

0.76 

282 


uuu 

1.23 

414 

Glu 

GAA 

1.04 

354 

Pro 

ecu 

1.74 

247 


GAG 

0.96 

326 


ccc 

0.40 

57 

Lys 

AAA 

1.04 

421 


CCA 

1.70 

241 


AAG 

0.96 

388 


CCG 

0.16 

22 

Arg 

CGU 

1.77 

153 

Ser 

ucu 

1.96 

310 


CGC 

0.72 

62 


ucc 

0.42 

67 


CGA 

0.44 

38 


UCA 

1.70 

270 


CGG 

0.09 

8 


UCG 

0.23 

36 


AGA 

2.08 

180 


AGU 

1.17 

186 


AGG 

0.90 

78 


AGC 

0.52 

82 






a AA is the abbreviation of amino acid. 

b N represents the number of occurrence of each sense codon. 

c The preferentially used codons for each amino acid are displayed in 
bold. 

codon usage in mammals and bacteria, synonymous codon 
usage in these viruses is also virus specific. 

To show whether there is a correlation between virus 
codon usage and its host, these 103 virus genes were di¬ 
vided into several groups according to the virus host. For 
example, because both SARSCoV genes and human Coro- 
navirus 229E infect human, genes in these two viruses were 
incorporated as a group. Next, Ltest was also used to test 
whether the separation of different viral genes which infect 
different hosts is significant. The P- value is 0.57 on the first 
axis and is 0.08 on the second axis, which suggested that 
codon usage in different virus genes was not host specific. 

3.3. Phylogenetic analysis of these viruses based on 
codon usage pattern 

In Fig. 1, all virus genes in the genus Coronavirus were 
plotted in red. At the same time, all viral genes in the genus 
Arterivirus were plotted in blue. Coronavirus genes are 
mainly located on the left side of the plot, while a majority 
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* TGEV 
A SHFV 

• SARSCoV 

* PRRSV 
v PEDV 
o MHV 

• LDEV 

A HCov 229E 
+ EAV 
+ BCoV 
A AIBV 


fl' (15.40%) 


Fig. 1. A plot of the values of the first axis and the second axis of each gene in CA (abbreviations of the viruses: AIBV, avian infectious bronchitis 
virus; BCoV, bovine Coronavirus', EAV, equine arteritis virus; HCoV 229E, human Coronavirus 229E; LDEV, lactate dehydrogenase elevating virus; 
MHV, murine hepatitis virus; PEDV, porcine epidemic diarrhea virus; PRRSV, porcine reproductive and respiratory syndrome virus; SARSCoV, SARS 
Coronavirus ; SHFV, simian hemorrhagic fever virus; TGV, transmissible gastroenteritis virus. f[ and ff respectively, represent the values of the first 
and the second axis of each gene in CA). 


of Arterivirus genes are located on the right side. The sep¬ 
aration of Coronavirus genes and Arterivirus genes on the 
first axis is statistically significant (7-test, P-value <10 -15 ). 
Hence, synonymous codon usage appears to be conserved 
between phylogenetically related viruses. 

Also, SARSCoV genes were widely extended in the first 
axis (Fig. 1). Six of eleven SARSCoV genes were located 
in the cluster of Coronavirus genes, while the other five 
SARSCoV genes were located in the cluster of Arterivirus 
genes. Therefore, SARSCoV might have been diverged far 
from all three known Coronavirus groups. Comparing with 
all other viruses in the genus Coronavirus , it might be more 
evolutionary related to the genus Arterivirus. 

3.4. Mutational bias is the main factor determines the 
codon usage variation among different viruses 

Linear regression analysis was implemented to find 
whether there is some correlation between synonymous 
codon usage bias and nucleotide compositions. The R 2 
value and significance level of these regression analyses 
was listed in Table 4. The first axis value of each gene 
in CA is closely correlated with all the base compositions 
on the third codon position, while the second axis of each 
gene is correlated with some base compositions on the third 
codon position to a certain extent. Therefore, compositional 
constraint mainly determines the variation of synonymous 
codon usage among these virus genes. 

Furthermore, we plotted the first axis values in CA and 
GC 3 S values of each gene (Fig. 2). The GC 3 S mean value 


of genes in coronaviruses ranges from 26.09 to 37.32, and 
it ranges from 45.18 to 53.76 in arteriviruses (Table 2). Al¬ 
though codon usage bias appears to be conserved between 
evolutionary related viruses (Section 3.3), the patterns of 
codon usage in different virus genes also appear to be a di¬ 
rect function of the GC content on the third codon position 
of these genes. 

3.5. Gene function also drives the codon usage variation 
among different viruses 

The plot of ENC and GC 3 S is another effective way to 
explore codon usage variation among genes (Wright, 1990). 
ENC values of each virus gene were plotted against its 

Table 4 


Summary of linear regression analysis between the first two axes in CA 
and the nucleotide contents on the third codon position in all selected 
virus genes a 


Base composition 

ff b 

J 1 

ff b 

12 

A3S 

0 791 **** 

0.085* 

T3S 

0 23q**** 

O. 444 **** 

G3S 

0.484**** 

0.082* 

C3S 

0.720**** 

0.0001 NS 

gc 3S 

0.936**** 

0.01 8 ns 


NS in superscript represent non-significant. 

a Value in this table is the R 2 value of each linear regression analysis. 
b f[ and ff respectively, represent the values of the first and the 
second axis of each gene in CA. 

* R-value <0.01. 

**** P-value <0.00001. 
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• Coronavirus 

• Arterivirus 


GC3S 


Fig. 2. A dot plot of the first axis value in correspondence analysis and GC 3 S of each gene (f[ denotes the first axis value in correspondence analysis 
of each gene, and GC 3 S denotes the G + C content on the third synonymous codon position of each gene). 


corresponding GC 3 S (Fig. 3). The solid line represents the 
curve if codon usage is only determined by GC content 
on the third codon position. A large proportion of points 
lie near to the solid line on the left region of this distribu¬ 
tion. It also suggests that mutational bias is the main factor 
determines the codon usage variation among these genes. 
However, there are also some points lying below the ex¬ 
pected curve. Hence, other than mutational bias, there might 
be some additional factors drive the codon usage variation 
among these genes. 

To show whether translational selection or gene function 
were correlated with the observed variation in codon bias, all 



Fig. 3. ENC vs. GC 3 S plot of all virus genes (ENC denotes the effective 
number of codon of each gene, and GC 3 S denotes the G + C content 
on the third synonymous codon position of each gene. The solid line 
represents the relationship between GG3S and ENC under random codon 
usage assumption). 


virus genes were grouped into several classes according to 
gene function. Because most of these viruses contain genes 
coding for RNA polymerase, envelop protein and structural 
glycoprotein, these three gene groups were selected to find 
whether there is some correlation between codon usage and 
gene function. One tailed f-test was then performed on ENC 
values of these genes with the hypothesis that there is no 
correlation between codon usage bias and gene function. 
Some associations have been found. Average codon usage 
bias is higher in RNA polymerase gene group than in en¬ 
velop gene group (f-test, P -value = 0.031), and it is higher in 
polymerase gene group than in structural glycoprotein gene 
group (f-test, P -value = 0.002). But, there is no association 
between codon usage in structural glycoprotein gene group 
and envelop protein gene group (f-test, P -value = 0.74). Be¬ 
cause the structural glycoprotein and envelop protein are all 
structural proteins in these viruses and RNA polymerase is 
a nonstructural protein, it is clear that codon usage in struc¬ 
tural genes is significantly diverged from that in nonstruc¬ 
tural genes. On the other hand, structural genes are generally 
highly expressed than nonstructural genes. So, if transla¬ 
tional selection was also contributed to codon usage bias in 
these genes, codon usage bias in structural genes should be 
higher than in RNA polymerase genes. However, RNA poly¬ 
merase genes (ENC = 49.25) were found to have greater 
codon usage bias than structural genes (ENC = 54.60 for 
envelop gene and ENC = 55.33 for structural glycoprotein). 
Hence, codon usage bias in these virus genes is not related 
to gene expression level. Furthermore, we also performed a 
linear regression analysis on ENC value and gene length of 
each gene. But, there was no significant correlation between 
codon usage and gene length in these virus genes (P- value > 
0.05). So, gene function, rather than translational selection 
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and gene length, is another factor accounting for codon us¬ 
age variation among these virus genes. 

4. Discussion 

Our analysis revealed that synonymous codon usage bias 
in SARSCoV was less biased, which was mainly determined 
by the base compositions on the third codon position. Com¬ 
parative analysis of codon usage bias in the order Nidovi- 
rales also suggested that codon usage in these viruses was 
virus specific and mutational bias was the main factor drives 
the codon usage variation among these viruses. Gene func¬ 
tion was also related to codon usage bias in these viruses 
to some extent. But, translational selection and gene length 
might have no effect on the codon usage pattern in these 
viruses. Some published results has shown that the overall 
extent of codon usage bias in RNA viruses is low and there is 
little variation in bias between genes (Levin and Whittome, 
2000; Jenkins et al., 2001; Jenkins and Holmes, 2003). Al¬ 
though SARSCoV is a newly detected RNA virus infecting 
human, the synonymous codon usage pattern in SARSCoV 
we described here is also in accordance with these pub¬ 
lished codon usage pattern of human RNA viruses (Jenkins 
and Holmes, 2003). Because mutation rates in RNA viruses 
are much higher than those in DNA viruses (Drake and 
Holland, 1999), it is understandable that mutation pressure 
is the main determinant of codon usage bias in SARSCoV. 
Our analysis also revealed that there was no host specific 
codon usage pattern in these viruses. So, host genome 
might have no obvious effect on the evolution of these 
viruses. 

Some phylogenetic analysis of SARSCoV (Qin et al., 
2003; Marra et al., 2003) has shown that SARSCoV does 
not closely resemble any of the three previously known 
groups in genus Coronavirus. But Snijder et al. (2003) 
has proposed that SARSCoV is most closely related to 
group 2 Coronavirus es. Based on different codon usage 
patterns in different coronaviruses, we revealed that codon 
usage patterns of each virus was phylogenetically distinct 
and SARSCoV might have been diverged far from all 
three known Coronavirus groups, which is in accordance 
with the results Qin et al. (2003) and Marra et al. (2003) 
proposed. 

Codon usage patterns and the phylogenetic results we pro¬ 
posed here are useful to understand the processes governing 
the evolution of SARSCoV, especially the roles played by 
mutation pressure and natural selection. Further, such infor¬ 
mation might be helpful to understand the pathogenesis and 
the origin of SARSCoV. 
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